Generative models have been widely studied in computer vision. Recently, diffusion models have drawn substantial attention due to the high quality of their generated images. A key desired property of image generative models is the ability to disentangle different attributes, which should enable modification towards a style without changing the semantic content, and the modification parameters should generalize to different images. Previous studies have found that generative adversarial networks (GANs) are inherently endowed with such disentanglement capability, so they can perform disentangled image editing without re-training or fine-tuning the network. In this work, we explore whether diffusion models are also inherently equipped with such a capability. Our finding is that for stable diffusion models, by partially changing the input text embedding from a neutral description (e.g., "a photo of person") to one with style (e.g., "a photo of person with smile") while fixing all the Gaussian random noises introduced during the denoising process, the generated images can be modified towards the target style without changing the semantic content. Based on this finding, we further propose a simple, light-weight image editing algorithm where the mixing weights of the two text embeddings are optimized for style matching and content preservation. This entire process only involves optimizing over around 50 parameters and does not fine-tune the diffusion model itself. Experiments show that the proposed method can modify a wide range of attributes, with the performance outperforming diffusion-model-based image-editing algorithms that require fine-tuning. The optimized weights generalize well to different images. Our code is publicly available at https://github.com/UCSB-NLP-Chang/DiffusionDisentanglement.
translated by 谷歌翻译
Recently, the success of pre-training in text domain has been fully extended to vision, audio, and cross-modal scenarios. The proposed pre-training models of different modalities are showing a rising trend of homogeneity in their model structures, which brings the opportunity to implement different pre-training models within a uniform framework. In this paper, we present TencentPretrain, a toolkit supporting pre-training models of different modalities. The core feature of TencentPretrain is the modular design. The toolkit uniformly divides pre-training models into 5 components: embedding, encoder, target embedding, decoder, and target. As almost all of common modules are provided in each component, users can choose the desired modules from different components to build a complete pre-training model. The modular design enables users to efficiently reproduce existing pre-training models or build brand-new one. We test the toolkit on text, vision, and audio benchmarks and show that it can match the performance of the original implementations.
translated by 谷歌翻译
Image super-resolution is a common task on mobile and IoT devices, where one often needs to upscale and enhance low-resolution images and video frames. While numerous solutions have been proposed for this problem in the past, they are usually not compatible with low-power mobile NPUs having many computational and memory constraints. In this Mobile AI challenge, we address this problem and propose the participants to design an efficient quantized image super-resolution solution that can demonstrate a real-time performance on mobile NPUs. The participants were provided with the DIV2K dataset and trained INT8 models to do a high-quality 3X image upscaling. The runtime of all models was evaluated on the Synaptics VS680 Smart Home board with a dedicated edge NPU capable of accelerating quantized neural networks. All proposed solutions are fully compatible with the above NPU, demonstrating an up to 60 FPS rate when reconstructing Full HD resolution images. A detailed description of all models developed in the challenge is provided in this paper.
translated by 谷歌翻译
电力电子转换器已被广泛用于航空航天系统,直流传输,分布式能源,智能电网等,电源电子转换器的可靠性一直是学术界和行业的热点。执行电力电子转换器开放电路故障和智能故障诊断以避免次要故障,减少操作和维护成本,并提高电力电子系统的可靠性,这一点很重要。首先,分析和总结电力电子转换器的故障特征。其次,对电源电子转换器中的一些基于AI的故障诊断方法和应用示例进行了审查,并提出了基于随机森林和瞬态故障特征的故障诊断方法,用于三相功率电子转换器。最后,指出了未来的研究挑战和基于AI的故障诊断方法的方向。
translated by 谷歌翻译
最新的工业推理引擎(例如FASTRASTRANSFORMER1和TURBOTTRANSFORMER)已验证了半精度的浮点(FP16)和8位整数(INT8)量化可以极大地提高模型推断速度。但是,现有的FP16或INT8量化方法太复杂了,使用不当将大大导致性能损害。在本文中,我们开发了一个工具包,供用户轻松量化其模型以进行推理,其中提出了自适应混合精液(SAMP),以通过混合精确体系结构自动控制量化率,以平衡效率和性能。实验结果表明,我们的SAMP工具包比Pytorch和Fertransformer具有更高的速度,同时确保了所需的性能。此外,SAMP基于模块化设计,将令牌,嵌入,编码器和目标层解耦,该层允许用户处理各种下游任务,并且可以将其无缝集成到Pytorch中。
translated by 谷歌翻译
先前的研究证明,跨语性知识蒸馏可以显着提高预训练模型的跨语义相似性匹配任务的性能。但是,在此操作中,学生模型必须大。否则,其性能将急剧下降,从而使部署到内存限制设备的不切实际。为了解决这个问题,我们深入研究了跨语言知识蒸馏,并提出了一个多阶段蒸馏框架,用于构建一个小型但高性能的跨语性模型。在我们的框架中,合并了对比度学习,瓶颈和参数复发策略,以防止在压缩过程中损害性能。实验结果表明,我们的方法可以压缩XLM-R和Minilm的大小超过50 \%,而性能仅降低约1%。
translated by 谷歌翻译
科学文献是高质量的语料库,支持大量自然语言处理(NLP)研究。但是,现有数据集围绕英语,这限制了中国科学NLP的发展。在这项工作中,我们提出了CSL,这是一个大规模的中国科学文献数据集,其中包含396K论文的标题,摘要,关键字和学术领域。据我们所知,CSL是中文中的第一个科学文档数据集。 CSL可以用作中国语料库。同样,该半结构化数据是一种自然注释,可以构成许多监督的NLP任务。基于CSL,我们提出了一个基准,以评估跨科学领域任务的模型的性能,即摘要,关键字生成和文本分类。我们分析了现有文本到文本模型在评估任务上的行为,并揭示了中国科学NLP任务的挑战,该任务为未来的研究提供了宝贵的参考。数据和代码可在https://github.com/ydli-ai/csl上找到
translated by 谷歌翻译
使用增强现实(AR)用于导航目的,这表明在手术手术过程中协助医生有益。这些应用通常需要知道外科手术工具和患者的姿势,以提供外科医生在任务执行过程中可以使用的视觉信息。现有的医学级跟踪系统使用放置在手术室内的红外摄像头(OR)来识别感兴趣的对象附加并计算其姿势的复古反射标记。一些市售的AR头式显示器(HMD)使用类似的摄像头进行自定位,手动跟踪和估算对象的深度。这项工作提出了一个使用AR HMD的内置摄像机来准确跟踪复古反射标记的框架,例如在手术过程中使用的标记,而无需集成任何其他组件。该框架还能够同时跟踪多个工具。我们的结果表明,横向翻译的准确度为0.09 +-0.06毫米,可以实现标记的跟踪和检测,纵向翻译的0.42 +-0.32 mm,绕垂直轴旋转的0.80 +-0.39 ver。此外,为了展示所提出的框架的相关性,我们在手术程序的背景下评估了系统的性能。该用例旨在在骨科过程中复制K-Wire插入的场景。为了进行评估,为两名外科医生和一名生物医学研究人员提供了视觉导航,每次都进行了21次注射。该用例的结果提供了与基于AR的导航程序报告的相当精度。
translated by 谷歌翻译
创伤性脑损伤(TBI)患者的脑网络分析对于其意识水平评估和预后评估至关重要,这需要分割某些意识相关的大脑区域。但是,由于很难收集TBI患者的手动注释的MR扫描,因此很难构建TBI分割模型。数据增强技术可用于缓解数据稀缺问题。但是,常规数据增强策略(例如空间和强度转化)无法模仿创伤性大脑中的变形和病变,这限制了后续分割任务的性能。为了解决这些问题,我们提出了一种名为TBIGA的新型医学图像授课模型,以通过配对的脑标签图合成TBI MR扫描。我们的TBIGAN方法的主要优势在于,它可以同时生成TBI图像和相应的标签映射,这在以前的医学图像的先前涂上方法中尚未实现。我们首先按照粗到细节的方式在边缘信息的指导下生成成分的图像,然后将合成强度图像用作标签上填充的先验。此外,我们引入了基于注册的模板增强管道,以增加合成图像对的多样性并增强数据增强能力。实验结果表明,提出的TBIGAN方法可以产生具有高质量和有效标签图的足够合成的TBI图像,这可以大大改善与替代方案相比的2D和3D创伤性脑部分割性能。
translated by 谷歌翻译
最近,对抗机器学习攻击对实用音频信号分类系统构成了严重的安全威胁,包括语音识别,说话者识别和音乐版权检测。先前的研究主要集中在确保通过在原始信号上产生类似小噪声的扰动来攻击音频信号分类器的有效性。目前尚不清楚攻击者是否能够创建音频信号扰动,除了其攻击效果外,人类还可以很好地看待。这对于音乐信号尤其重要,因为它们经过精心制作,具有可让人的音频特征。在这项工作中,我们将对音乐信号的对抗性攻击作为一种新的感知攻击框架,将人类研究纳入对抗性攻击设计中。具体而言,我们进行了一项人类研究,以量化人类对音乐信号的变化的看法。我们邀请人类参与者根据对原始和扰动的音乐信号对进行评分,并通过回归分析对人类感知过程进行反向工程,以预测给定信号的人类感知的偏差。然后将感知感知的攻击作为优化问题提出,该问题找到了最佳的扰动信号,以最大程度地减少对回归人类感知模型的感知偏差的预测。我们使用感知感知的框架来设计对YouTube版权探测器的现实对抗音乐攻击。实验表明,感知意识攻击会产生对抗性音乐的感知质量明显优于先前的工作。
translated by 谷歌翻译